Conclusion 1 : There are no Null Data or missing values in any of the columns.

Observations

Multidimensional voice program (MDVP) is studied here.

Some ideas and concepts are grabbed from this paper : https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&ved=2ahUKEwjSmpOF_4HuAhWd93MBHajlAAsQFjAKegQIDhAC&url=https%3A%2F%2Fcyberleninka.org%2Farticle%2Fn%2F389565.pdf&usg=AOvVaw1WGZGniDRjOtgaJ4T0MEBF

and this paper : https://www.scielo.br/scielo.php?pid=S1516-18462015000401341&script=sci_arttext&tlng=en

and this paper : https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5434464/

MDVP:Fo :

  1. Mean and Median Approximately match, that means data might be properly distributed with some outliers. But observing the maximum and minimum values within the table give sus the idea that the maximum value is beyond the 3 standard deviations away, so that means it requires some kind of conditioning based on the correlation (possibly fit transformation)
  2. Average of the fundamental frequency, average of the maximum and the average of the minimum lie appropriately i.e in a particular order (in between), which indicates that data is approrpriate and don't need much more analysis.

MDVP:Fhi :

  1. Mean and median are not exactly overlapping, but not exactly too far away an is within one standard deviation away.
  2. But the Maximum value is way beyond the double of the 75%. That means there are some terrible outliers within the data implying a great variation within the data. But it indicates maximum of the High frequency reported and varies with the entry (the patient in question). The minimum is within one standard deviation from the 25% indicating that data might probably skewed toward the right.

MDVP:Flo :

  1. The standard deviation is above that of standard deviation of the Average Vocal Fundamental Frequency, implying that more people might be towards lower end of frequencies, probably because more lower frequency ranges are included, probably implying that the patients might be men more than women because women have higher frequencies of low fundamental frequencies.
  2. maximum Value for this column seem to follow more or less similar pattern to that of previous column : It has double the 75% as its maximum indicating that there are some strong outliers.

MDVP:Jitter (in %):

  1. This is the most fundamental measure of how much words are spilled across from their intention. So, keeping that in mind we can say that there seems to be pretty even distributed because the mean approximately coincides with median value.
  2. The minimum value is within the 2nd Std from the 25%, but the maximum value is several times the std indicating huge outliers. But is it connected to proper checking of the correlation between the possibilities need to studied after the heatmap is plotted.
  3. The maximum value, if it has a strong correlation with other quantities, especially the status of the subject is interesting to see the least.

MDVP:Jitter (Absolute Jitterness) :

  1. The maximum and minimum values are all over the place. Though minimum value lies within the 25% of the data, the maximum value is definitely out of even the 5 times the std. That means there are extremely outliers and there is also a skew, implied by the noncoincidence of the median and the mean.

MDVP:RAP :

  1. This is a relative quantity, hence it can't be a huge quantity (this is a ratio). But since the 25%, 75% are within one standard deviation, we can say that density of the data is huge here implying that average here is pretty much same with not much variation within it.
  2. Mean value is more towards the 75% of the data, implying the data is skewed to the right.

MDVP:PPQ :

  1. Exact same conclusions can be drawn as the column before i.e. MDVP (RAP) but it indicates pitch peturbation Quotient.
  2. That also means there might be either interdependent data or they have some very strong correlation. This needs to be properly analyzed.

Jitter : DDP (Indicating the difference between jitter cycles) :

  1. Minimium value is within one std away from the mean, while the maximum value is way above it. So, that means there are some terrible outliers.
  2. Mean and median value are almost on the same line that implies that they are even and good in their density distribution. But, it looks like that data needs proper conditioning to really extract the outliers.

MDVP:Shimmer(dB) :

  1. Mean lies somewhere in between the 50% and 75% implying that there are some intereting relationships with other Quantities. A lot of other columns showed very similar distribution. It remains to see how much correlation they have with each other sharing the same kind of distribution.
  2. Once again maximum value is WAY above the two std of 75% of the data implying that there are some terrible outliers.

Shimmer:APQ3 (Three point amplitude Perturbation Quotient) :

  1. Mean coincides with median more or less and the minimum value is way below the std estimates and the same goes with max value. That implies that the data is scattered a lot but at the same time, it might be concentrated at one particular point, but it remains to be seen how it needs to be understood.

Shimmer:APQ5 (Five point amplitude Perturbation Quotient) :

  1. Min value is more than 3 Deviations from the 25% value, and hence, it might be on the least side of it. But, it remains to be seen in the correlation plot.

MDVP:APQ (Amplitude Perturbation Quotient) :

  1. Mean is also aligned towreds the 75% of the data. That means there might be a little skew.
  2. Maximum the quantity is a lot of times std over the 75% indicating a lot of outliers.

Shimmer:DDA :

  1. Mean and median more or less closer to each other implying that the data might be uniformly distributed.
  2. It shares the same qualities as the column just above.

NHR :

  1. It shares the same characteristic as the other above fields.

HNR :

  1. Mean coincides with Median, where as the minimum and maximum values are both many times the Quartile percentiles.

Status :

  1. This is our target column which is categorical. The Mean is around 0.75, which means that alot of entries here are positively diagnosed with the Parkinson's Disease.

RPDE (Relative Period Density Entropy) & DFA :

  1. This is the column that seems to be perfectly distributed and pretty normal in its distribution.

Rest of the Columns are nonlinear. We can't accurately guess what is going to be happening with them because the fitting might be something else with them. They need to be evaluated before anything.

Trying to estimate on the preliminary basis what columns seem to have a strong correlation with the target column viz. status, which is also categorical variable.

Exploratory Data Analysis

Observations and Notes

  1. It looks like HNR has no correlation with any quantity whatsover, so we can happily drop it.
  2. Status is our target column, and it has no correlation greater than 0.5 with any of the columns except two columns viz. "spread1" and "PPE" column.
  3. It is our choice to decide whether we want to eliminate the outliers of a particular column or not, based on how influential they are (correlation) with our target column.
  4. Since "status" as a categorical column is our interest, and it has strong correlation only with "spread1" and "PPE" columns, we will try to eliminate outliers only from those columns. But we can change it anytime in the parameters supplied to the function as given below.

Target Column Analysis

Frist Checking the target column to get familiar of how it is given.

There seems to be some outliers within the data for "spread1" Column and "PPE" column. And those outliers are exclusively restricted to the Quartile3 i.e. above the maximum value and doesn't seem to be below the lower Quartile value. That is an important observation of skew of the data.

Let us now go into more detail

Let us do some preparatory work for flexibility and tuning so that we can arbitrarily switch to what columns we choose and how we can deal with the features, dropping them as we wish and dealing with their distribution.

Please Note : I have extensively used the functioning programming approach to these. But better practice would be to use classes, which I didn't use. I plan to do it in future projects.

First doing some pairplot to get a sense of the entire data.

Setting up of a few Functions that will help us eventually

Univariate plot setting up

Bivariate Plot Setting up

We are going to set a target column here and the correlation threshold for that target column and create a function that will give us the

Now that we have developed highly customizable functions to choose columns/features as we like and for whatever data, in future projects and everywhere, let us now move ahead with setting up a similar stance for Outlier identification and elimination with high customizability for future projects and this project.

Moving to actual Analysis here.

Let us now analyze how many outliers are actually there and where they are scattered.

So, there are 98 outliers in total combining both the columns of interest. We can check this with any other column by just modifying the data. The Outliers that are overlapping in both of the columns might be very interesting to look at. We need to have an indexed location of those entries which have outliers in both "PPE" column and "spread1" column. That might reveal some surprising insights.

Please Note : We are not developing a new function here for generalizing this. We will do that later, since it is much simpler this way. Also because it is completely unnecessary.

Since these outliers are significant in number, we have to do something about them. There might also be some repeated entries, some just don't try to fit in. We need further analysis and setup to deal with this in detail. So, lets get into this.

We can't simply eliminate them, neither can be we replace them with average values, because if we do so, their weightage and correlation with other columns will be upset and will upset the overall effectiveness of our chosen model. It is acceptable to repalce NaN values with median or mean but not extreme values. Fit Transform is also not an option, except to eliminate them.

Let us now try to eliminate the extreemities that are shared between both the columns i.e. extremities in both PPE column and spread1 column. But before that we will check how they are both distributed with respect to each other.

There seems to be a very good correlation between both the features, with so much overlapping. That means they are very highly correlated. Let us recheck it again.

There is a huge correlation between both. So, let us now proceed to check if those share the same index values. If so that means those outliers are a bigger influence than anything else on the actual prediction of the data.

We should drop the NaN values from the dataframe here onwards to make real use of it.

So just eliminating 13 values from the dataframe won't make a difference where the total entries are around 195. We have a loss of around just 6% of the data points. So, we are sticking to this dataframe and not anything else. Because any more loss of datapoints is simply not acceptable.

The dataframe now generated has just 182 entries with extreme values dropped from both the columns of "PPE" and "spread1". Now let us check the correlation between them and compare it with our previous ones, i.e. without dropping.

There seems to be not much difference even in the correlation relation, ever after dropping. It only makes a small difference to the status, and seems to be a little decrease in the correlation between PPE and spread1.

One's Personal choice is to not loose too many data points in the hunt for eliminating extreme values. There are no Null values or no missing Entry values. We can now go ahead and drop the entries from the given table.

We might as well write some function that would eliminate outliers irrespective of their proper relationship for a given column. Use them later when required. But it is not being executed here.

We Will start with Reinitializing everything and brining it all together.

Note that if we drop one entry that means we are also loosing appropriate legitimate entry in other columns too. So, that would devastate the model and make data as thin as imaginable.

Let us now compare the dataframes of both dropped extemum ones and the original ones.

Difference among the correlation of both the dataframes and comparing them here. But note here that we can always go back to tweak, increase or decrease the columns of interest simply by lowering the correlation ratio threshold and probably adding more columns manually.

Now let us take the analysis to next stage by checking the correlation difference now between the target column and others.

So, the total outliers pushed by the two columns of interest towards the dataframe constitutes around 27 % of the original data. That is more than a Quarter of the data. So, it is going to affect our model in more unpredictable ways than that can be corrected later by addition of more data or neutralized more normal data.

Conclusion : We can't proceed with the data such that outliers in both columns are eliminated, we can only proceed with model building for the data, which has entries cleaned of entries that are outliers in both the columns of interest.

Of couse we can modify this any time we want.

We are going to bring back the dataframe (data_duplicate) we originally developed and create a copy of it, initialize it for model training.

Getting Data Ready for Models

Splitting the Data in the ratio 70:30

Personal Opinion : The data entries are so few in number that perhaps any model we fit into might not have proper predicitons or proper models. So, I am not sure where this is going to lead.

Note : Before we go any further we might want to consider the option of eliminating and scaling such that data is properly trained in the model. I am not going to do it but generate a function that will do it for us.

Logistic Regression

Observation : Model Score doesn't seem to be bad, but it is not above 90%, so it might be okay.

Now We are going for Naive Bayes

Naive Bayes

Now Let us go for KNN for training the data.

KNN Model

Let us now jump to SVM model and see how the scores are. We can take a decision for fit transforming the data for fitting the data after all the models have been evaluated.

SVM Model

Decision Tree Model

Model seems to be completely overfit with so much high train score. This needs further modification of the data to fit into the picture.

If graphviz doesn't work, we can use plot_tree method from sklearn.tree. Make this cell into code

import matplotlib.pyplot as plt from sklearn.tree import plot_tree

feature_train_names = list(x_train) class_names_chosen = ['No', 'Yes'] fig, axes = plt.subplots(nrows = 1,ncols = 1,figsize = (4, 4), dpi=300) plot_tree(decision_tree, feature_names = feature_train_names, class_names=class_names_chosen, filled = True)

fig.savefig('tree.png')

Inference :

  1. As expected PPE pose one of the most important features. Also in it are D2 and MDVP:Fo as a surprise. But it is said that they are nonlinearly distributed. So, we have to keep that in mind. Checking their distribution might reveal more important detail here.
  2. But also notice that there is absolutely no importance for a few columns in the middle. Those can be dropped altogether and our model can be bettered.

We will try to do some fit transformation now for the rest of the models.

Ensemble Learning

Bagging

AdaBoosting

GradientBoost

RandomForest Classifier

Model comparison and some conditioning

Note : f1-Score has already been calculated before and is placed inside the classification_report_list. We can access it direclty after some proper cleaning. But that takes away all the fun in programming. So, I am doing it here manually.

Some Comments and Observations

Comments :

I have deliberately not taken into account that definitely some models are overfitting with perfect Train score which is completley unusual. It might be resolved, if we go for either

  1. Eliminate more outliers which can be done just a few simple modifications done in the Data Aanalysis section, with a few kernel restart and running the program.
  2. Fit transform the data, which is also chosen at the beginning of the Model fitting by just supplying the method and scaler for a a transformation.
  3. Based on the data presented above and the choices made, based on the high f1 score and high accuracy and confusion matrix entries KNN Algorithm seems to be working the best.
  4. We can argue about the Area under Curve being high for Bagging, we have to make abou tit again.

Observations :

  1. The amount of data provided is grossly insufficient for any kind of modelling to do because there are a lot less entries and a lot more features, where more features have a huge correlation (greater than 0.9) among themselves, than the actual target variable i.e. "status".
  2. More analysis and methods need to be added to check for obvious entries and any wrong or nonsensical entries within the input data, but given the number of features that are present, it takes a lot of resources (computing resources and analyzing time) than that is available at my disposal.
  3. Some features can certainly be dropped like that of "HNR" and "MDVP:Fo" columns which had absolutely no correlation with any kind of feature or data.
  4. The only categorical column here is the "status" column, which is the target column. It doesn't work well on the model if the categorical column is put up from predictions of Integer columns. Usually categorical columns predictions go really well with categorical column features.

Further Modifications